† Corresponding author. E-mail:
Information entropy has been proved to be an effective tool to quantify the structural importance of complex networks. In a previous work [Xu et al. Physica A,
In the field of network science, real-world complex systems are abstracted as complex networks, in which nodes represent individuals and links denote the connections or interactions between individuals.[1–3] Nowadays, although we can obtain abundant data of various complex systems due to advanced technologies, it is demonstrated that large parts of the data of the complex systems are still not available, and there are non-ignorable errors in the data that we obtain.[4,5] Thus, new methods are needed to process, correct, and make predictions from the data. Link prediction methods aim to predict the missing or future links among network data.[6,7] Specifically, they estimate the existence likelihood of links between two nodes based on observed links and nodes’ attributes. Link prediction has broad applications,[8] such as detecting potential interactions in protein–protein interaction network,[9] recommending friends and goods in online social networks,[10] exploring potential coauthor relationships in collaboration networks,[11] and so on.
Previous algorithms are basically from the field of machine learning, including supervised learning,[12] Markov chain,[13] and likelihood estimation.[14] These algorithms heavily depend on the attributes of nodes, and do not seriously consider the structural characteristics of networks. Besides, their computation cost is inhibitive for large real-world networks.[15] Recently, the booming network science community has achieved deeper insights into the structure of complex networks,[1] and further stimulates the research of link prediction.[6,16] Lots of prediction algorithms based on structural similarity are proposed, which can be classified into three types: local indices,[17–22] quasi-local indices,[23,24] and global indices.[25,26] For example, common neighbors (CN),[18] preferential attachment (PA),[17] Adamic–Adar (AA),[21] and resource allocation (RA)[22] are local indices, which only use the nearest neighborhood information. Katz,[25] Leicht–Holme–Newman (LHN),[26] and SimRank,[27] which need the knowledge of the whole network topology, are global ones. Quasi-local indices, including local path (LP),[23] local random walk (LRW),[24] and superposed random walk (SRW),[24] need more topological information than local indices, but less topological information than global ones. Generally, local indices have the lowest prediction accuracy, but their computational cost is the smallest among the three types of similarity indices. Global indices are the opposite of local indices, while quasi-local ones are the trade-off between them.
Recently, information entropy has been employed to measure the complexity of the topological structures of complex networks.[28] Results showed that information entropy can better capture the topological difference than the other typical network measurements.[29,30] Moreover, information entropy has a natural connection with link prediction problems, in which the probability of a missing link between two nodes can be transformed into the corresponding information entropy. Thus, researchers began to apply information entropy theory to link prediction problems in complex networks.[31–33] For example, Tan et al.[31] reexamined the role of common neighbors in link prediction by using the mutual information, and proposed a mutual information-based similarity index. Xu et al.[32] derived the information entropy of a path, and studied the contributions of paths in link prediction based on path entropy, and finally provided a path entropy based similarity index. Simulation results[31–33] showed that the similarity indices based on information entropy have higher prediction accuracy than the other types of similarity indices.
In real society, the interactions between individuals are of different strengths for many complex networks, which are called weighted networks.[2,34] For instance, in the air traffic network, the weight of a link is measured by the number of passengers in the related flight. In the router-level of the Internet, the weights of links are generally correlated with the bandwidth of the physical connections or the cost for data transmission between routers. In social networks, weights of links are related to the interacting times or frequencies between individuals. Therefore, it is reasonable to consider the link weights when designing link prediction algorithms to further improve the prediction accuracy. So far there have been a few tries in the literature. Murata and Moriyasu[35] improved the CN and AA indices by using the link weights information. Bai et al.[36] developed the weighted version of the LP index. Lü et al.[37] particularly explored the role of weak ties in link prediction and found that weak ties can improve the prediction accuracy effectively. Simulation results showed that the weighted version of these typical similarity indices indeed have a higher accuracy than the original ones. However, they still have a large space to improve. For example, the weighted and unweighted versions of AA and CN indices only utilize the information of paths with length 2, which can be further improved by considering longer paths, such as paths with length 3. In addition, although the LP index considers paths of length 3, it only cares about the number of paths for unweighted networks or the sum of link weights for weighted networks, while the topological information of those paths can be further explored to improve the prediction accuracy.
In this paper, we apply information entropy to link prediction in weighted networks. Specifically, we reexamine the role of paths in link prediction by considering both the path entropy and the path weight, and propose a weighted similarity index applicable to the weighted networks. Through simulation on real-world weighted networks, we show the prediction accuracy of our index and make comparisons with other typical weighted indices.
Quantitative measurement of complex networks is a hot topic in network science,[1] and various measurements are proposed, including degree, betweenness, closeness, k-core number, and so on.[2] However, most of the measurements are for nodes and links, and only a very few are particularly for paths, such as path length[38] and path attack centrality.[39] In our latest work,[32] we applied information theory to measure the importance of paths, and specifically we derived the entropy of a path. Assuming that L ab 1 (L ab 0) represents the event that there is (not) a link between nodes a and b in the network without degree correlation. Then, the probability of L ab 1 is calculated by
(1) |
(2) |
For a simple path
(3) |
(4) |
Although the weights of links and weighted networks have been well discussed, there is no specific result about the weights of paths. In traffic routing protocols, we usually choose the optimal path, of which the sum of the cost of all links is the smallest among the candidate paths.[40–42] Enlightened by this, here we define the weight of path D as the sum of its links’ weights,
(5) |
In the framework of information theory, the probability of link existence between two nodes can be expressed with information entropy. Then, the link prediction problem can be defined as the conditional entropy,[31] that is
(6) |
Previous results showed that the longer the path is, the less important the path is in link prediction. Moreover, results demonstrated that the link weights can be used to improve the prediction accuracy. Here we combine the path length, path weight, and path entropy to calculate the contribution of path D with length i in the link prediction:
(7) |
(8) |
(9) |
Assuming an undirected network
Two standard metrics, AUC and precision, are applied to quantify the prediction quality. To calculate AUC, n times of independent score comparisons are made between node pairs in
(10) |
(11) |
Next, we introduce three indices and their weighted versions that we use for comparison. They are common neighbors (CN),[18] Adamic–Adar index (AA),[21] and local path (LP),[23] which are defined as
(12) |
(13) |
(14) |
(15) |
(16) |
(17) |
Six weighted networks from disparate fields are used to test the accuracy for various prediction indices. The directions of links are ignored. The self-connections and multiple links are deleted from the network data. For the unconnected networks, we select the maximum connected components for experiments. The statistics of these networks are summarized in Table
To investigate the ability of our prediction index, we perform experiments on the six weighted networks, and make comparisons with three other typical indices. Both the unweighted and weighted versions of these indices are tested. The link weights of the networks are only considered when calculating the weighted versions of these indices. In addition, for the weighted versions of these indices, we adjust the control parameter α to achieve the near optimal performances measured by AUC and precision. For the PE and WPE indices, l = 2 means only paths with length of 2 are used in the calculation, while l = 3 indicates that paths with lengths of both 2 and 3 are used in the calculation. In theory, we can use the information of all paths in link prediction, but in real experiments, we usually consider only short paths, since short paths are more important than long paths in link prediction. On the other side, the longer the paths are, the larger the computational cost is. Our simulation results are shown in Tables
Tables
Figure
In summary, we study the link prediction problem in weighted complex networks from the perspective of information entropy. In fact, the likelihood of a link between two nodes can be converted into entropy, and small entropy corresponds to large probability of link existence. In this paper we consider the contributions of simple paths between node pairs in link prediction. Specifically, we measure the contribution of an existing path by considering its length, entropy, and weight which is further defined as the sum of link weights in the path. Furthermore, we propose a weighted path entropy (WPE) index for link prediction by considering the contributions of all existing simple paths. Through simulation on several real-world weighted networks, we find that WPE has higher prediction ability measured by AUC and precision than the other typical weighted similarity indices. In fact, the PE index proposed in our previous work already has high prediction accuracy. Thus, by appropriately considering the path weight, WPE further improves the prediction accuracy. We also find that a weak tie is more critical than a strong tie in link prediction. Note that in our context, a weak tie refers to a path with small weight, while in the other works a weak tie means a small weight link.
1 | |
2 | |
3 | |
4 | |
5 | |
6 | |
7 | |
8 | |
9 | |
10 | |
11 | |
12 | |
13 | |
14 | |
15 | |
16 | |
17 | |
18 | |
19 | |
20 | |
21 | |
22 | |
23 | |
24 | |
25 | |
26 | |
27 | |
28 | |
29 | |
30 | |
31 | |
32 | |
33 | |
34 | |
35 | |
36 | |
37 | |
38 | |
39 | |
40 | |
41 | |
42 | |
43 | |
44 | |
45 | |
46 | |
47 | |
48 |